Skip to content

API: DataFrame.to_csv formatting parameters for float indexes #11681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 27, 2015
Merged

API: DataFrame.to_csv formatting parameters for float indexes #11681

merged 1 commit into from
Dec 27, 2015

Conversation

nbonnotte
Copy link
Contributor

Fix issue #11553

Two things:

  1. I've created a Float64Index._format_native_types method which is a copy-paste of FloatBlock.to_native_types. I would have preferred to call the latter directly, but I'm not sure what the placement parameter of the FloatBlock constructor means. I guess I doesn't really matters, since I could put whatever value and it should work (I think), and my hesitation a bit unfounded, but I don't know if it would be really a clean solution. Maybe someone can think of a more elegant way?

  2. Since a Float64Index containing only NaNs collapses when part of a multi-index, its NaNs values would not be converted using na_rep, so I had to hack a solution. I put a comment in the relevant part. I'm not quite convinced myself of the elegance of the solution, though.

What do you think?

@nbonnotte
Copy link
Contributor Author

And I've corrected the fact that the decimal option was not taken into account for 0.0, by replacing the use of %g formatting with a call to str

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 24, 2015
# if any index contains only NaNs, it has collapsed into an empty
# Float64Index, and when the multiindex has been recomposed
# the NaNs have come back as NaNs, not as strings corresponding to
# na_rep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-performant. What exactly is the issue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hu, I realize my comment is not only not-clear, but also non-accurate

Let's consider the example in my tests:

df = DataFrame({'a': [0, np.NaN], 'b': [0, 1], 'c': [2, 3]}).set_index(['a', 'b'])

The index is a multi-index, with the following levels:

Float64Index([0.0], dtype='float64', name=u'a')
Int64Index([0, 1], dtype='int64', name=u'b')

After the calls to _format_native_types, the variable levels contains:

['0.0']
['0', '1']

and afterwards mi.values contain:

[('0.0', '0'), (nan, '1')]

Here, nan is not a string "nan", but numpy.NaN (which is printed as the string nan)

Then, for that reason, in the tests I've introduced, the following test fails:

Traceback (most recent call last):
  File "/home/nicolas/Git/pandas/pandas/tests/test_format.py", line 2965, in test_to_csv_na_rep
    self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected)
AssertionError: 'a,b,c\n0.0,0,2\nnan,1,3\n' != 'a,b,c\n0.0,0,2\n_,1,3\n'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way to fix this is set both the levels & labels; e.g. you append a new value to the level (which will represent the nans), then change the -1 in the labels to that value (the position of that value, e.g. 1 in this case). This will then reformat the MultiIndex to work correctly.

You can do this in the MultiIndex constructor of _format_native_types

In [20]: df.index.set_levels([0.0,'_'],level=0).set_labels([0,1],level=0).values
Out[20]: array([(0.0, 0), ('_', 1)], dtype=object)

@nbonnotte
Copy link
Contributor Author

@jreback all set

@@ -3878,6 +3878,32 @@ def _convert_slice_indexer(self, key, kind=None):
# translate to locations
return self.slice_indexer(key.start, key.stop, key.step)

def _format_native_types(self, na_rep='', float_format=None,
decimal='.', quoting=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is basically identical to core/internals/FloatBlock/to_native_types. let's pull both those out and put it in a function in core/format.py/FloatArrayFormatter and call it .get_formatted_data(). See how that works out. These are the routines for screen printing (which are necessarily different from to_csv / index formatting).

@jreback jreback added this to the 0.18.0 milestone Nov 29, 2015
@jreback
Copy link
Contributor

jreback commented Nov 29, 2015

ok, looks pretty good. I think we can take this opportunity to re-factor a bit as I have noted above.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2015

@nbonnotte if you can update / refactor as above would be great

@nbonnotte
Copy link
Contributor Author

I will, just had not the time yet. If you'd like this to be done quickly, because of the schedule for the 0.18.0 release, let me know.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2015

no, just pinging :)

@jreback
Copy link
Contributor

jreback commented Dec 16, 2015

lmk when you can update

@nbonnotte
Copy link
Contributor Author

@jreback I just pushed the changes. Let me know if other changes are needed.

Happy Holidays!

expected = "a,b,c\n_,0,2\n_,1,3\n"
self.assertEqual(df.set_index('a').to_csv(na_rep='_'), expected)
self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected)
# check if na_rep parameter does not break anything when no NaN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

@jreback
Copy link
Contributor

jreback commented Dec 26, 2015

lgtm. some very minor formatting changes. in general like to have blank lines between different sub-tests and to format code nicely.

ping when pushed and green.

@@ -2101,6 +2105,32 @@ def _format_strings(self):

return fmt_values

def get_formatted_data(self):
values = self.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a doc-string here describing what this is doing. Some comments in the code as well (to describe the purposes of the if clauses)

@nbonnotte
Copy link
Contributor Author

If you and Travis agree, that should be it.

I have added some comments as you suggested, to make the code clearer. But to be honest I'm not quite sure about how some bits of the code fit with the rest, for instance how and where the quoting parameter is handled. I mostly moved stuff around, and it's difficult to explain things in comments when one does not understand what is going on. I hope the comments I wrote are OK, though.

Also, I suppose FloatArrayFormatter could be refactorized a bit, because right now it looks like more a potpourri, a mixture of different bits of code that are related but do not have much code in common. But I'm not sure it is worth the effort.


# same but for an index
self.assertEqual(
df.set_index('a').to_csv(decimal='^'), expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, are there tests for quoting? if not can you add a couple. thxs.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2015

@nbonnotte code looks great. just see if we have some tests for quoting if not, pls add them.

the explanations are fine. Just want to have a note instead of a mass of code to explain a bit what is happening.

as far as refactoring, you got what I was looking for (integration between Index and data formatting). if you see additional thinks, don't hesitate with another PR!

ping when pushed / green (or if tests ok, lmk)

@nbonnotte
Copy link
Contributor Author

Yeah, the quoting parameter is used in pandas/tests/test_format.py, at least in test_to_csv_quotechar(), test_to_csv_doublequote(), and test_to_csv_escapechar().

jreback added a commit that referenced this pull request Dec 27, 2015
API: DataFrame.to_csv formatting parameters for float indexes
@jreback jreback merged commit f295c0a into pandas-dev:master Dec 27, 2015
@jreback
Copy link
Contributor

jreback commented Dec 27, 2015

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants